Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

Long-read sequencing holds great potential for characterizing complex microbial communities, yet taxonomic profiling tools designed specifically for long reads remain lacking. We introduce Melon, a novel marker-based taxonomic profiler that capitalizes on the unique attributes of long reads. Melon employs a two-stage classification scheme to reduce computational time and is equipped with an expectation-maximization-based post-correction module to handle ambiguous reads. Melon achieves superior performance compared to existing tools in both mock and simulated samples. Using wastewater metagenomic samples, we demonstrate the applicability of Melon by showing it provides reliable estimates of overall genome copies, and species-level taxonomic profiles. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03363-y.

long reads.In the experiments, the reviewer also expects to see the performance of Melon w.r.t.some data properties of TGS.For example, how does Melon's performance change with the read length, error rate, and coverage?Many TGS data may not be able to achieve a very high coverage.Seems these key factors are not evaluated.
How important is the EM step?Seems its contribution is not evaluated in the Results Section.It is only first mentioned in Discussion.
The chosen benchmark tools are not designed for long reads.Kraken is not specifically designed for long reads.Also, as the authors pointed out, BWA in mOTU is also not optimized for long reads.Thus, the comparison is not very fair.Although the authors mentioned that MetaMaps, MEGAN-LR, and BugSeq are "sequence-based", can they be customized with different reference database?If so, you can replace their reference data with your protein database?Additionally, you can also directly run Melon against these tools and compare their final outputs.A minor comment is related to the names of the two methods.After all, marker genes are still sequences.So using "sequence-based" to exclude marker-based tools can be confusing.Maybe "genome-based" is a better name.
Method 5.1.1It will be clearer to describe the change of the number of sequences w.r.t. the multiple "filtration" strategies.In particular, it is not clear what is the purpose of the second alignment with full KOfam database.Also, is the scaled cutoff applied to both the 91 and 11,298 pHMMs or just the latter?The described procedure is hard to reproduce.
Overall, I found the descriptions at several places are hard to follow partly because of the scattered description of the major steps in different places and partly because the description is like a laundry list lacking justification.Two examples are given below.The second paragraph of Section 2.1 is a little hard to follow.The authors said "four factors are considered".But there lacks clear description of how to reduce 91 pHMMs to 8 pHMMs and how each factor is used specifically.Later, seems '8' refers to only 8 marker gene not 8 gene clusters (pHMMs) in 5.1.3.Another example.Up to Page 10, it is still not clear how the authors identify reads that cover marker genes.The authors mentioned "Melon first extracts reads that cover at least one marker gene using a protein database".Fig. 1 said "Aligning sequences of assemblies to the marker gene database".Neither of "extract" nor "align" is very specific.In 5.1.5,the authors mentioned the usage of "DIAMOND" for aligning RefSeq assemblies with the protein database.Better highlight the used method/tool for the major steps when those steps are introduced.
There are a large number of cutoffs/parameters.Is there a better way to present them and highlight the justifications?Minor: Does the last sentence mean that the authors applied the 8 pHMMs to all nr and env-nr proteins and kept the aligned proteins?Reading the detailed methods can help understand these descriptions.But reading the paper following the given order can be a little confusing.
Melon is a computational tool.Thus, Fig. 1 a. is a little confusing as it contains sampling, DNA extraction, sequencing.
2.3.2 please specify the tool you used to simulate TGS data at its first mention.
Reply to the review report on GBIO-D-23-01745: Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes June 4, 2024 Dear Editor, Thank you for your time and effort in handling our manuscript.Enclosed please find a point-by-point response to the reviewers' comments and the revised manuscript entitled "Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes" for consideration of publication in Genome Biology.

Reviewer 1
In this study, the authors bridge a significant gap between metagenomic sequencing and data processing by developing 'Melon', a novel marker-based taxonomic profiler tailored for long-read sequencing data.This innovative tool represents a significant advancement over existing methodologies, such as mOTUs3, Kraken2, and Bracken, particularly in its native ability to process long-read sequences without the need for preprocessing steps like fragmenting into short reads.The comprehensive evaluation conducted by the authors, including both in vitro mock and in silico simulated samples, showcases Melon's superior performance in species identification and abundance estimation.The paper is well-written and employs appropriate methods for simulation and evaluation.It will undoubtedly capture the interest of researchers in microbiome and environmental microbiology fields, offering a valuable tool for the accurate and efficient analysis of long-read sequences.Moreover, the significance of this work extends to public health, particularly in its potential application for tracking antibiotic resistance genes (ARGs) in complex environments such as wastewater.
Before proceeding, I would recommend the authors address the following questions.

Author response:
Thank you for the comments.We have expanded the evaluation of marker genes by including average nucleotide identity comparisons and have used a human fecal dataset to demonstrate the validity of ARG quantification.Please find our point-by-point responses below.

Comment 1
Line 38: "marker" in the so-called marker-based profiler still refers to a sequence, so I suggest retaining the original terminology from the paper to avoid confusion, such as using DNA-to-DNA/Protein or DNA-to-Marker method.Please ensure this consistency is maintained throughout the manuscript.

Comment 2
Line 76: why is this approach considered inappropriate, and how does it impact the results?The manuscript would benefit from a more detailed explanation.

Reply:
Cutting long reads into artificial short reads removes all long-range information (Portik et al., 2022).Given that most marker genes are longer than 300 bp, these short reads may not be able to cover the full range of the genes, leading to ambiguous alignments and, as a consequence, lower taxonomic resolution.In Fig. 3a, we see that mOTUs clearly underestimated the number of total genome copies for the mock samples (S1-3 and G1-3).
Although this large discrepancy might be caused by the inappropriate choice of alignment settings and tools (i.e., BWA with presets for Illumina reads), we noticed that, on average 5.35% of the estimated copies were not assigned with species-level taxonomic IDs.Therefore, we believe this strategy of cutting long reads into short reads is more of a compromise and not well-suited for long-read metagenomics.Melon (v0.1.0, NCBI database, ver. 2023-07-31), Kraken (v2.1.3, Standard database, ver. 2023-06-05), Bracken (v2.8, same database as Kraken) and mOTUs (v3.1.0,default database, with modifications).Shapes of points denote different mock communities, D6300 and D6331.Tables provide summarized metrics

Comment 3
Line 112: if F 0.5 is the weighted harmonic mean of precision and recall, this should be explained at its first mention, along with clarifications on F 1 and x/y ratios.

Comment 4
Line 123: the feasibility of aligning 0.9 million sequences on a standard laptop is unclear without benchmarking for the computing resources required.Although this represents a significant reduction in the number of reads, it's still a substantial volume.How large is the database? Reply: We recorded the run time and peak memory usage of Melon on the influent and effluent samples (sequenced with Nanopore R10.4.1).For a 5 Gb effluent sample, Melon was able to finished in 23 minutes on an M1 Max chip MacBook Pro (profiling against 310,881 RefSeq assemblies).The full resource requirements are shown in Supplementary Table S1.
The database after compression is roughly 2.9 GB for RefSeq R219 (https://zenodo.org/records/8418111) and 4.3 GB for GTDB R214 (https://zenodo.org/records/8418135).In Fig. 2d, we see that almost all RefSeq complete genome assemblies possessed exactly eight marker genes.Moreover, for both bacteria and archaea, assemblies with more than eight marker genes tended to have high contamination, whereas assemblies with fewer than eight marker genes low completeness (predicted by CheckM2).

Comment 6
Line 163, I believe this is critical in determining the resolution of Melon.Please add the distribution of the similarity of selected markers among different species/strains, which will give readers a clearer picture of the accuracy and potential of these marker genes.

Reply:
We thank the reviewer for pointing this out.The within-species/between-strains similarity of marker-gene-containing sequences is indeed crucial in determining the taxonomic resolution of Melon.This was overlooked in our previous assessments of marker genes.We have now included a new experiment to compare the similarity of marker-gene-containing sequences at strain or species levels.
Briefly, we computed the pairwise average nucleotide identity (ANI) between all marker-genecontaining sequences (at most 10,000 bp) using skani v0.2.1 (skani dist) (Shaw et al., 2023).The identity of a pairwise comparison was recorded if the aligned fraction was greater than 75% for both the query and the reference.Archaeal ribosomal protein genes s2, s7, and s9 were not included.
Given that RPGs are in general more conserved compared to other nonessential genes, it was expected that some between-species ANIs exceeded the conventional ANI cutoff for species identification, i.e., 95%.However, for most marker genes, there was still a gap between the within-species and within-genus ANIs.This large difference in species/genus-level ANIs provided evidence that these marker-gene-containing sequences were sufficient for delivering species-level taxonomic resolution.

Comment 7
I am very interested to see how the RPGs assessment goes in GTDB instead of RefSeq, considering that RefSeq has issues with confused taxonomy annotation.Since GTDB relies solely on genome similarity, it could potentially retain more markers, possibly enhancing abundance estimation with a larger pool of marker genes.

Reply:
Using GTDB for assessment is indeed a great idea as it already has a list of species-level representative assemblies, selected not only according to completeness and contamination scores but also to other criteria such as being assembled from type material (Parks et al., 2022).We therefore reassessed the 91 RPGs using all complete genome representative assemblies (n = 7, 523) of the latest GTDB (R220).RPGs satisfying all three criteria (darkest colors in all three rows) for either bacteria or archaea are marked in bold.The bar chart on the right side shows the number of assemblies (species) present in each phylum.Phyla with only a single assembly are grouped under "others".
As the reviewer anticipated, with GTDB, seven more bacterial RPGs could pass the universality and deviance cutoff, including s8, s10, l13, l17, l19, l22, and l24 (Fig. R2).There were no additionally selected RPGs for archaea.In fact, archaeal l18e was no longer selected as it was missed in one of the 13 tiny archaeal genomes (ca.Micrarchaeota archaeon, GCA_021654395.1, excluded by RefSeq due to (1) derived from metagenome and (2) genus undefined) for some reason.These additional RPGs were unfortunately found to be located closely to the previously selected RPGs, and after clustering based on their genomic distances, we still obtained eight clusters.As mentioned in Section 3, collocation is a known issue for RPGs, although they have the advantages of being essential and conserved.In the future, we might consider including genes other than RPGs, to expand the candidate list.Upon closer inspection of the assessments we found that, although with GTDB more RPGs could be retained, the overall universality and deviance tended to be more pessimistic (Fig. R3).This could be caused by the fact that NCBI considers many unclassified species as individual species, even though they have a high ANI with known species (e.g.Escherichia sp., taxonomic ID 1884818).

Comment 8
Line 231: the explanation of Bracken and mOTUs' performance is confusing.Please provide a clearer explanation and define what is meant by the recovery rate.

Reply:
The recovery rate shows the number of genome copies that could be detected by the corresponding profilers and is synonymous with the x/y ratio that we defined previously.We used these two terms interchangeably, with the exception that the recovery rate might refer to species-level genome copies, while the x/y ratio exclusively to total genome copies.

Comment 9
Line 237: what was the rationale behind choosing 500,000 reads for the simulation??

Reply:
The simulator we used (metagenome mode of NanoSim v3.1.0)does not support simulation at a given sample size but requires users to specify a targeted number of reads.Since CAMI2 samples have a fixed number of bases (5 Gb), we chose 500,000 reads to ensure that the maximum size of samples did not deviate too far from 5 Gb.The resulting simulated samples had sizes ranging from 2.3 Gb to 5.3 Gb (depending on the models used for simulation).

Comment 10
Line 254: in which species richness?or are you using the Wilcoxon signed-rank test?

Reply:
Yes, we used the two-sided Wilcoxon signed-rank test to compare the dissimilarity metrics (between ground truth and estimated relative abundances of species) returned by different tools.Here we didn't compare the richness as it can be reflected by the precision-recall plots: Kraken2 tended to have an inflated richness estimation since it had many false positive species, while Melon might sometimes underestimate the richness due to its relatively low recall.

Comment 11
Line 300: It's not just Kraken that faces the problem mentioned; it's a common issue for every metagenomic profiler.The discussion could instead focus on the type of relative abundance generated by DNA-to-DNA methods.

Comment 12
Line 306: it is good to see the application of Melon, but adding some validation, such as using shotgun sequencing data and a well-established pipeline for ARG calculation as a comparison, would greatly enhance its credibility.

Reply:
We agree with the reviewer that the application of Melon lacks justification.We downloaded and reanalysed the 109 Singaporean human faecal samples (paired Nanopore and Illumina data) from PRJEB49168 (Singapore Platinum Metagenomes Project) (Gounot et al., 2022) using both the strategy we presented in this manuscript and ARGs-OAP v3.2.4 (Yin et al., 2022).ARGs-OAP is a short-read-based ARG quantification pipeline developed by our group.
Since ARGs-OAP does not provide species-level taxonomic information, we compared the total ARG abundance (expressed as ARG copies per genome copy) of the samples.As shown in Fig. S7, the estimated ARG abundances were highly correlated (Spearman's correlation ρ = 0.941, permutation test, two-sided, p < 0.001).
Despite being highly correlated, the ARG abundances estimated by long reads were systematically higher due to: (1) the database we used in this manuscript was the full version of the SARG database, which includes additional genes such as transcriptional regulators (e.g., activators and repressors) compared to ARGs-OAP's short version; (2) ARGs-OAP considers two-or three-component systems (e.g., arcA-arcB-tolC, the primary efflux pump of E. coli) as single units.These genes are weighted by a factor of either 1/2 or 1/3, leading to a significant reduction in ARG abundance estimation; and (3) the cutoffs we employed for long reads (identity 75% and subject cover 75%) may not be directly comparable to those used by ARGs-OAP (identity 80% and query cover 85%).
We note that developing a long-read-based ARG quantification pipeline was not our primary goal in this project.Instead, we aimed to demonstrate Melon's ability to provide reliable species-level genome copy estimates, which could be used as normalizing constants for ARGs.
Further evaluation and parameter tuning may be required to achieve more precise ARG abundance estimation.

Comment 13
Additionally, given that the primary difference between Melon and mOTUs is the selection of marker genes and the alignment of flanking sequences used in abundance estimation, I see no reason why this method couldn't also be applied to short sequencing reads.Perhaps this could be evaluated in the manuscript or considered for future work.

Reply:
Thank you for this suggestion.The marker gene database we constructed can indeed be applied to short sequencing reads, and the total genome copies estimated should still be accurate given appropriate alignment settings and tools.However, due to length limitations, short reads may not be able to cover sufficient flanking regions and therefore cannot provide detailed species-level genome copies.In fact, we plan to replace the single-copy marker gene database of ARGs_OAP with the newly constructed protein database, but this requires more detailed evaluation.We leave this for future work.genome copy estimation.If most marker genes are located near the replication origin, the estimated genome copies could be overestimated for fast-growing species.

Coverage
Relative genomic distance to oriC c. estimations.Black dots represent sequencing coverage at different genomic locations, smoothed using 5,000 bp non-overlapping windows.Colored dots display the genomic locations of marker genes in four sets: Melon (8 genes), GTDB (120 genes), mOTUs (10 genes), and MicrobeCensus (30 genes).Dashed black lines denote expected genome copies, whereas solid colored lines stand for estimated genome copies produced by averaging the coverage of marker genes in different sets, smoothed again using 5,000 bp windows.Melon's genome copy estimates are the closest to the expected genome copies of both Bacillus amyloliquefaciens (B.amyloliquefaciens, Gram-positive) and Pseudomonas alloputida (P.alloputida, Gram-negative), owning to its relatively balanced marker gene distributions.Log 2 -transformed peak-to-trough ratios (PTRs) are displayed in text.OriC means the origin of replication.

Comment 2
As long reads can contain insertion/deletion sequencing errors, I am wondering whether the authors did any error correction, which can affect the gene prediction and translation.
Or is DIAMOND under some setting is very robust against sequencing errors?Did the authors consider using optimized tools for long read alignment? Reply: Yes, as long reads can have a large proportion of insertion/deletion errors the genes predicted (e.g., via Prodigal) from these reads may not be trustworthy.We didn't correct or assemble the reads but employed DIAMOND's long-read mode blastx to do the translated search directly.
This long-read mode was introduced to DIAMOND in version 0.9.23 and, according to the developers, is capable of conducting frame-shift aware DNA-to-protein alignment (Arumugam et al., 2019).This mode is also used by MEGAN-LR to handle error-prone long reads.
We admit that we didn't look into the detailed parameter settings of this mode, as we empirically found the default parameters worked well for all the mock samples, which had a wide range of quality scores.As is shown in Fig. 3a, the total genome copies were obtained directly from DIAMOND's frame-shift aware alignment and were all close to the expected values.

Comment 3
Until Section 5.1.5, it is not obvious which part of the methodology design is optimized for long reads.In the experiments, the reviewer also expects to see the performance of Melon w.r.t.some data properties of TGS.For example, how does Melon's performance change with the read length, error rate, and coverage?Many TGS data may not be able to achieve a very high coverage.Seems these key factors are not evaluated.

Reply:
The major difference between Melon and other marker-based methods (mOTUs, MetaPhlAn) is that Melon's taxonomic labelling is based on marker genes plus their flanking regions, rather than the marker genes themselves.Although this extension is straightforward, it allows Melon to take advantage of long reads and give more detailed species-level taxonomic labels with high accuracy.In Fig. S2, we see that the flanking regions were indeed important and there was an obvious improvement in precision and recall when the sequence length increased from 5,000 to 10,000 bp.This implies that using marker genes alone (which are typically below 1,000 bp) is not adequate for long-read taxonomic classification.
5,000 bp 10,000 bp 15,000 bp precision recall  Models used for simulation were trained using six mock samples (S1-3 and G1-3).Each combination of simulation models and numbers of species contains ten randomly generated profiles.Colors represent the mean values of these profiles.

Mean value
In that sense, Melon can also be seen as a DNA-to-DNA profiler with a specialized database.
First, the incorporation of flanking regions allows Melon to go beyond DNA-to-marker profiling and fully utilize the long-range information of long reads.Second, the single-copy and universal nature of marker genes enables Melon to directly yield taxonomic abundance rather than sequence abundance (which requires genome-size correction to get taxonomic abundance).Third, the relatively small size of the database permits Melon to profile against the entire collection of GTDB, which currently includes around 600k assemblies (R220), using base-level alignment on a standard laptop.In comparison, profiling against the entire GTDB with Kraken may require terabytes of memory, which is not considered computationally feasible for most users.
The properties of TGS, including coverage, quality score q, and read length l, are indeed important for evaluating long-read profilers.However, unlike Illumina simulators, TGS simulation tools such as NanoSim cannot directly simulate a series of samples with various q and l combinations but require error profiles to generate simulations.This is the main reason we collected six mock samples with a wide range of read characteristics (mean q 10.9-19.0,mean l 4.1k-10.4k,Supplementary Table S3).We trained six different models using these mocks to mimic their read characteristics and used them for simulation instead.In the complexity experiment, we changed the number of species (32-512) and fixed the total number of reads to achieve different coverages.We saw that Melon did not work well when the number of species was high (and in turn, the coverage of species was low), but this is a common limitation for marker-based methods.a Gigabase pair (10 9 base pairs).
b Two spike-ins species Allobacillus halotolerans and Imtechella halotolerans were removed before calculation.
Nevertheless, to demonstrate the robustness of Melon with respect to q and l, we cut mock G1 into 25 bins.As shown in Fig. R4, Melon performed equally well for all q and l, except for q > 20, where the total number of bases ranged from 1.7 Mb to 17.4 Mb.Such low coverage hindered EM from correcting false positive reads, leading to relatively low precision.
Recall was also miserably low due to insufficient coverage.Both x/y ratio and Bray-Curtis were more varied, indicating again the poor performance of Melon at low coverage.
Figure R4 for reviewer 2: Performance of Melon with different q and l.

Comment 4
How important is the EM step?Seems its contribution is not evaluated in the Results Section.It is only first mentioned in Discussion.

Reply:
EM helps to eliminate false positive species and is important for achieving high precision.
Although the EM step may skew the relative abundance estimates if multiple highly similar species are present, we believe in reality this negative effect is small (see Comment 5).

Comment 5
The chosen benchmark tools are not designed for long reads.Kraken is not specifically designed for long reads.Also, as the authors pointed out, BWA in mOTU is also not optimized for long reads.Thus, the comparison is not very fair.Although the authors mentioned that MetaMaps, MEGAN-LR, and BugSeq are "sequence-based", can they be customized with different reference database?If so, you can replace their reference data with your protein database?Additionally, you can also directly run Melon against these tools and compare their final outputs.
except for Centrifuger.This suggests that these classifiers are less sensitive to quality score q but more sensitive to read length l.We also see that the additional EM step (minimap2+EM) greatly reduced the number of false-positive misclassifications but might accidentally skew the relative abundance estimate by introducing true-positive misclassifications due to the presence of highly similar species (e.g., Neisseria meningitidis being classified as Neisseria gonorrhoeae, ANI 94.94%).This experiment indicated the advantages of using base-level alignment plus post-error correction for accurate taxonomic labelling of long reads.
We didn't add MetaMaps and MEGAN-LR to the main text by rerunning the mock/simulation experiments due to resource limitations.We note that although MetaMaps uses approximate rather than base-level mapping, for the marker-gene-containing sequences that we simulated (approximately 1 Gb), it took over a day to finish on a lab-scale server (40-thread CPU, 500 GB memory), possibly due to poor implementation.MEGAN-LR by default requires mapping against NCBI nr or nt, which is also computationally costly.misassembly; for instance, protein WP_109570822.1 is fused by a ribosomal protein s2 and a translation elongation factor).These protein sequences cannot be detected using only the 91 RPG profiles and are generally favored by alignment tools due to their longer lengths.
We want to minimize the effects of these proteins but acknowledge that the number of these proteins is rather small (n = 38, 568) and their effects may be negligible.
The scaled cutoff was applied to both the 91 and the 11,298 PHMMs.We justified the choice of 0.75 in Supplementary Table S2.

Comment 8
Overall, I found the descriptions at several places are hard to follow partly because of the scattered description of the major steps in different places and partly because the description is like a laundry list lacking justification.Two examples are given below.
The second paragraph of Section 2.1 is a little hard to follow.The authors said "four factors are considered".But there lacks clear description of how to reduce 91 pHMMs to 8 pHMMs and how each factor is used specifically.Later, seems '8' refers to only 8 marker gene not 8 gene clusters (pHMMs) in 5.1.3.

Reply:
We apologize for the lack of clarity.Our intention was to provide an introductory overview of Melon and its database (i.e., description of Fig. 1) in Section 2.1 before moving on to marker-gene selection in Section 2.2.Therefore, it was inevitable to mention the "four factors" in both Sections 2.1 and 2.2, even though the detailed procedure for reducing the 91 PHMMs to 8 PHMM clusters using these factors and selecting marker genes from each cluster was described only in Section 2.2.We agree that this might cause confusion when reading in the current order.To address this, we have added a reminder to guide readers more clearly.

Changes:
We have added a reminder before the "four factors" in Section 2.1: A subset of RPGs were screened as marker genes (eight each for both bacteria and archaea, :::: see ::::::: section ::::::::: "Quality ::::::::::: assessment :: of ::::::::: PHMMs :::: and ::::::: RPGs" ::: for ::::: more ::::::: details) by assessing the quality of each according to four distinct factors: (1) F 0.5 -scores of ::::::::: (weighted :::::::::: harmonic ::::: mean ::: of ::::::::: precision :::: and :::::: recall) ::: of their associated PHMMs, (2) their prevalence among species, (3) discrepancies between their average numbers of copies and the ideal count of one, and (4) their mean relative genomic distances to other candidate RPGs (Fig. 1c).sample, Melon first extracts reads that cover at least one marker gene using a protein database (DIAMOND), and then profiles the taxonomy of these marker-containing reads using a nucleotide database (minimap2).The main output of Melon is a tab-delimited table listing the estimates of species' genome copies and relative abundances.Grey dashed arrows and text indicate necessary sample preprocessing steps to obtain metagenomic long reads.b.The construction of the protein database is initiated by re-annotating NCBI protein sequences using hmmsearch and a set of RPGrelated PHHMs.c.A subset of RPGs are selected as marker genes based on their universality, deviance, F 0.5 -scores, and mean relative genomic distances.d.The nucleotide database is built by extracting 10,000 bp genomic regions, which encompass marker genes and their adjacent flanking regions, from 310,881 RefSeq assemblies using DIAMOND and SeqKit.These assemblies represent 44,057 bacterial and 1,016 archaeal species.a-d.Solid colored dots stand for marker genes, while semi-transparent grey dots represent other genes.Circular elements denote genomes, while linear elements signify reads.

Comment 10
There are a large number of cutoffs/parameters.Is there a better way to present them and highlight the justifications? Reply: Thank you for pointing this out.We agree that a large number of cutoffs and parameters were mentioned in this manuscript.Here is a brief summary of all the major cutoffs and parameters we considered (Table R1):

Comment 12
Melon is a computational tool.Thus, Fig. 1 a. is a little confusing as it contains sampling, DNA extraction, sequencing.

Reply:
We thank the reviewer for noting this.We agree that these words in Fig. 1a can cause confusion.However, removing the upper-left two subfigures would require reformatting the layout of the entire figure.To mitigate this, we have changed the arrows of these necessary sample preprocessing steps from "solid" to "dashed" and the annotations from "black" to "grey".The caption has also been updated for clarity (see Comment 9).

Comment 13
2.3.2 please specify the tool you used to simulate TGS data at its first mention.
Figure R2for reviewer 1: Selection of marker genes (GTDB).Mean copies, universality and overall performance of RPGs.Sizes of white dots represent the values of universality, with universality of one being omitted.Overall universality, deviance and F 0.5 -scores of RPGs are shown below the heatmaps of mean copies.Shapes of triangles indicate evaluation sets, with top-right triangles representing the full set of assemblies, and bottom-right triangles the tiny genome subset.RPGs satisfying all three criteria (darkest colors in all three rows) for either bacteria or archaea are marked in bold.The bar chart on the right side shows the number of assemblies (species) present in each phylum.Phyla with only a single assembly are grouped under "others".

Figure
Figure R3for reviewer 1: RPG assessments with NCBI and GTDB.
Figure S7: ARG abundances estimated using short and long reads.Colors and shapes indicate subjects' age and gender, respectively.

Figure 2c :
Figure 2c: Selection of marker genes.Effects of marker genes' locations on genome copy

Figure S2 :
Figure S2: Performance of Melon at different length cutoffs of flanking regions.

Figure S6 :
Figure S6: Comparison of taxonomic assignment strategies.a. Species-level misclassification.Color indicates the percentage of misclassified reads (Mis.%).Wild-type E. coli is labelled with an asterisk.b.Overall misclassification.Color represents types of misclassification.Characteristics of simulation profiles are shown in tables.

Table S1 :
Computational time and peak memory usage for wastewater samples a Excluding multidrug ARGs.b Not tested with MacBook Pro due to insufficient memory.c

Table S3 :
Statistics of mock and wastewater samples

Table S2 :
Performance of PHMMs at different scales of threshold scores scale precision recall F 0.5 -score F 1 -score RPGF a a Number of RPGFs with PHMM's F 0.5 -scores greater than 0.99.
Table R1 for reviewer 2: Summary of parameters and cutoffs a D: Default.T: Tunable.J: Justified.E: Empirical.